Introduction to machine learning
This module provides the basis for the rest of the course by introducing the basic concepts behind machine learning, and, specifically, how to perform machine learning by using RStudio and the CARET R package. First, you will learn how machine learning and artificial intelligence are disrupting businesses. Next, you will learn about the basic types of machine learning and how to leverage these algorithms in a R script. Third, you will learn how linear regression can be considered a machine learning problem with parameters that must be determined computationally by minimizing a cost function. Finally, you will learn about neighbor-based algorithms, including the k-nearest neighbor algorithm, which can be used for both classification and regression tasks.
articulate the different types of machine learning algorithms and provide examples where each type might be applied within the Accounting profession,
describe the challenges in cleaning and pre-processing data,
apply the Caret Package to perform basic machine learning tasks,
understand the importance of a cost function and the importance of minimizing this functions,
perform general linear regression by using the Caret Package, and
apply the k-nearest neighbor algorithm for both classification and regression tasks.
This lesson explores the fundamentals of machine learning and artificial intelligence and how these tools are being used in accountancy and business in general.
How is accountancy and finance world using artificial intelligence
What is the growing role artificial intelligence (and by association, machine learning) play in modern accounting?
What does the impact technology plays in shaping careers in modern accountancy?
How is artificial intelligence impacting financial auditing?

Your most important skill will be your ability to translate data into insights that are clear and meaningful to a stakeholder.
This is the most common of all forms. In business, it provides the analyst with a view of key metrics and measures within the company.
An example of this could be a monthly profit and loss statement. Similarly, an analyst could have data on a large population of customers. Understanding demographic information on their customers (e.g. 30% of our customers are self-employed) would be categorised as “descriptive analytics”. Utilising useful visualisation tools enhances the message of descriptive analytics.
This is the next step in complexity in data analytics is descriptive analytics. On the assessment of the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem.
Well-designed business information (BI) dashboards incorporating reading of time-series data (i.e. data over multiple successive points in time) and featuring filters and drill down capability allow for such analysis.
Predictive analytics is all about forecasting. Whether it’s the likelihood of an event happening in future, forecasting a quantifiable amount or estimating a point in time at which something might happen - these are all done through predictive models.
Predictive models typically utilise a variety of variable data to make the prediction. The variability of the component data will have a relationship with what it is likely to predict (e.g. the older a person, the more susceptible they are to a heart-attack – we would say that age has a linear correlation with heart-attack risk). These data are then compiled together into a score or prediction.
In a world of significant uncertainty, being able to predict allows one to make better decisions. Predictive models are some of the most important utilised across many fields.
The next step up regarding value and complexity is the prescriptive model. The prescriptive model utilises an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take. A prescriptive analysis is typically not just with one individual response but is, in fact, a host of other actions.
An excellent example of this is a traffic application helping you choose the best route home and taking into account the distance of each route, the speed at which one can travel on each road and, crucially, the current traffic constraints.
Another example might be producing an exam time-table such that no students have clashing schedules.
As a field, machine learning is both expansive and mathematically complex. From deriving simple linear relationships via regression analysis to finding clusters of data points in an N-dimensional space; statistical and machine learning techniques can take years to fully master. However, given the short time available in this course, we will take the simpler approach of demonstrating several commonly used approaches in order to both introduce the fundamental concepts in machine learning and the methodology we will use in RStudio to apply these concepts to actual data. For the latter, we will use the standard machine learning library in R, which is the Caret Package.
We will demonstrate the four main tasks of machine learning: classification, regression, dimensional reduction, and clustering. Note that this module is simply an introduction to these topics, we will explore these and other areas in more detail throughout this course. Finally, we will discuss how to persist machine learning models.
Make sure you have the caret package installed.
install.packages("caret")
Run your libraries
The first steps in any data analytics effort, once the business goal has been defined, are to understand and prepare the data of interest. For example, the Cross Industry Standard Process for Data Mining or (CRISP-DM) starts with the Business Understanding step, immediately followed by the Data Understanding and Data Preparation steps. For machine learning analyses, these latter two steps require loading the data into our notebook, exploring the data either systematically or in a cumulative sense to understand the typical features for different instances. We also can generate descriptive statistical summaries and visualizations, such as a pair plot, to understand the data in full. Finally, we will need to clean the data to properly account for missing data, data that are incomplete or formatted incorrectly, or to generate meta-features (such as a date-time) from existing features.

For this module, we will focus on a single, simple data set, the standard Iris dataset, which is included by default. Note that given a data set, such as the Iris data, we have rows, which correspond to different instances (e.g., different flowers), and columns, which correspond to different features of the instances (e.g., different measurements of the flowers). To understand the data, we first load this data into RStudio, before looking at several instances from the data. Next, we will group the data by species to explore cumulative quantities, before extracting a statistical summary of the entire data set. Finally, we will generate a pair plot to visually explore the data. Since this data has already been cleaned (and only consists of four features) we will not need to perform additional tasks.
Load the data
#load the data
iris<-iris
The data set consists of 150 total measurements of three different types of Iris flowers, equally divided between three classes: Iris setosa, Iris versicolor, and Iris virginica. Before proceeding, we can examine the DataFrame that contains these data to view typical instances, to see a cumulative summary, and a brief statistical summary.
#examine the top 5 rows
head(iris,5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
#view the whole dataset
knitr::kable(iris)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "100%",height="300px")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 5.4 | 3.7 | 1.5 | 0.2 | setosa |
| 4.8 | 3.4 | 1.6 | 0.2 | setosa |
| 4.8 | 3.0 | 1.4 | 0.1 | setosa |
| 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 5.8 | 4.0 | 1.2 | 0.2 | setosa |
| 5.7 | 4.4 | 1.5 | 0.4 | setosa |
| 5.4 | 3.9 | 1.3 | 0.4 | setosa |
| 5.1 | 3.5 | 1.4 | 0.3 | setosa |
| 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| 5.4 | 3.4 | 1.7 | 0.2 | setosa |
| 5.1 | 3.7 | 1.5 | 0.4 | setosa |
| 4.6 | 3.6 | 1.0 | 0.2 | setosa |
| 5.1 | 3.3 | 1.7 | 0.5 | setosa |
| 4.8 | 3.4 | 1.9 | 0.2 | setosa |
| 5.0 | 3.0 | 1.6 | 0.2 | setosa |
| 5.0 | 3.4 | 1.6 | 0.4 | setosa |
| 5.2 | 3.5 | 1.5 | 0.2 | setosa |
| 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 4.8 | 3.1 | 1.6 | 0.2 | setosa |
| 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 5.2 | 4.1 | 1.5 | 0.1 | setosa |
| 5.5 | 4.2 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.2 | 1.2 | 0.2 | setosa |
| 5.5 | 3.5 | 1.3 | 0.2 | setosa |
| 4.9 | 3.6 | 1.4 | 0.1 | setosa |
| 4.4 | 3.0 | 1.3 | 0.2 | setosa |
| 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 5.0 | 3.5 | 1.3 | 0.3 | setosa |
| 4.5 | 2.3 | 1.3 | 0.3 | setosa |
| 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 5.0 | 3.5 | 1.6 | 0.6 | setosa |
| 5.1 | 3.8 | 1.9 | 0.4 | setosa |
| 4.8 | 3.0 | 1.4 | 0.3 | setosa |
| 5.1 | 3.8 | 1.6 | 0.2 | setosa |
| 4.6 | 3.2 | 1.4 | 0.2 | setosa |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
| 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
| 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
| 5.5 | 2.3 | 4.0 | 1.3 | versicolor |
| 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
| 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
| 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
| 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
| 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
| 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
| 5.9 | 3.0 | 4.2 | 1.5 | versicolor |
| 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
| 6.1 | 2.9 | 4.7 | 1.4 | versicolor |
| 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
| 6.7 | 3.1 | 4.4 | 1.4 | versicolor |
| 5.6 | 3.0 | 4.5 | 1.5 | versicolor |
| 5.8 | 2.7 | 4.1 | 1.0 | versicolor |
| 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
| 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
| 5.9 | 3.2 | 4.8 | 1.8 | versicolor |
| 6.1 | 2.8 | 4.0 | 1.3 | versicolor |
| 6.3 | 2.5 | 4.9 | 1.5 | versicolor |
| 6.1 | 2.8 | 4.7 | 1.2 | versicolor |
| 6.4 | 2.9 | 4.3 | 1.3 | versicolor |
| 6.6 | 3.0 | 4.4 | 1.4 | versicolor |
| 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
| 6.7 | 3.0 | 5.0 | 1.7 | versicolor |
| 6.0 | 2.9 | 4.5 | 1.5 | versicolor |
| 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
| 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
| 5.5 | 2.4 | 3.7 | 1.0 | versicolor |
| 5.8 | 2.7 | 3.9 | 1.2 | versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | versicolor |
| 5.4 | 3.0 | 4.5 | 1.5 | versicolor |
| 6.0 | 3.4 | 4.5 | 1.6 | versicolor |
| 6.7 | 3.1 | 4.7 | 1.5 | versicolor |
| 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
| 5.6 | 3.0 | 4.1 | 1.3 | versicolor |
| 5.5 | 2.5 | 4.0 | 1.3 | versicolor |
| 5.5 | 2.6 | 4.4 | 1.2 | versicolor |
| 6.1 | 3.0 | 4.6 | 1.4 | versicolor |
| 5.8 | 2.6 | 4.0 | 1.2 | versicolor |
| 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
| 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
| 5.7 | 3.0 | 4.2 | 1.2 | versicolor |
| 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
| 6.2 | 2.9 | 4.3 | 1.3 | versicolor |
| 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
| 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
| 6.3 | 3.3 | 6.0 | 2.5 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 7.1 | 3.0 | 5.9 | 2.1 | virginica |
| 6.3 | 2.9 | 5.6 | 1.8 | virginica |
| 6.5 | 3.0 | 5.8 | 2.2 | virginica |
| 7.6 | 3.0 | 6.6 | 2.1 | virginica |
| 4.9 | 2.5 | 4.5 | 1.7 | virginica |
| 7.3 | 2.9 | 6.3 | 1.8 | virginica |
| 6.7 | 2.5 | 5.8 | 1.8 | virginica |
| 7.2 | 3.6 | 6.1 | 2.5 | virginica |
| 6.5 | 3.2 | 5.1 | 2.0 | virginica |
| 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 6.8 | 3.0 | 5.5 | 2.1 | virginica |
| 5.7 | 2.5 | 5.0 | 2.0 | virginica |
| 5.8 | 2.8 | 5.1 | 2.4 | virginica |
| 6.4 | 3.2 | 5.3 | 2.3 | virginica |
| 6.5 | 3.0 | 5.5 | 1.8 | virginica |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 6.0 | 2.2 | 5.0 | 1.5 | virginica |
| 6.9 | 3.2 | 5.7 | 2.3 | virginica |
| 5.6 | 2.8 | 4.9 | 2.0 | virginica |
| 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 6.3 | 2.7 | 4.9 | 1.8 | virginica |
| 6.7 | 3.3 | 5.7 | 2.1 | virginica |
| 7.2 | 3.2 | 6.0 | 1.8 | virginica |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica |
| 6.1 | 3.0 | 4.9 | 1.8 | virginica |
| 6.4 | 2.8 | 5.6 | 2.1 | virginica |
| 7.2 | 3.0 | 5.8 | 1.6 | virginica |
| 7.4 | 2.8 | 6.1 | 1.9 | virginica |
| 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 6.4 | 2.8 | 5.6 | 2.2 | virginica |
| 6.3 | 2.8 | 5.1 | 1.5 | virginica |
| 6.1 | 2.6 | 5.6 | 1.4 | virginica |
| 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| 6.3 | 3.4 | 5.6 | 2.4 | virginica |
| 6.4 | 3.1 | 5.5 | 1.8 | virginica |
| 6.0 | 3.0 | 4.8 | 1.8 | virginica |
| 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 5.9 | 3.0 | 5.1 | 1.8 | virginica |
#examine grouped data
iris%>%
group_by(Species)%>%
summarise(count=n())
# A tibble: 3 x 2
Species count
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
# Get descriptive statistics
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Another handy package
install.packages("psych")
psych::describe(iris)
vars n mean sd median trimmed mad min max range
Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6
Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4
Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9
Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4
Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0
skew kurtosis se
Sepal.Length 0.31 -0.61 0.07
Sepal.Width 0.31 0.14 0.04
Petal.Length -0.27 -1.42 0.14
Petal.Width -0.10 -1.36 0.06
Species* 0.00 -1.52 0.07
# ?psych::describe
Look up what mad, se, skewness, and kurtosis is… 🕵️
As demonstrated by the output from the previous code cells, our test data matches our expectations (note that the full Iris data set is listed on Wikipedia). These data consist of three types, each with fifty instances, and every row has four measured features (i.e., attributes). The four primary features of the data are Sepal Length, Sepal Width, Petal Length, and Petal Width. In simple terms, petals are the showy, colorful part of the Iris flower, while the sepals provide protection and support for the petals.
In addition, our cursory exploration of the DataFrame indicated the data are clean. One simple way to verify this is that the count is the same for every feature, and the descriptive statistics (e.g., min, max, and mean) are all numerical. If we had missing or bad data in our DataFrame, these measures would generally indicate the problem. If there were missing data, we could drop any instance with missing data by using the na.omit method, or alternatively insert a value by using mutate . An alternative, and powerful, technique for handling missing data is known as imputing, where we apply machine learning2 to generate realistic values for any missing data. This approach will be demonstrated in a subsequent module.
At this point, we have loaded our data, and verified the data are clean. The next step is to visualize the relationships between the different features in our data.
Lets use another package 😆
install.packages("GGally")
Look at me 👀 ggpairs function 👀
GGally::ggpairs(iris)

GGally::ggpairs(iris, aes(color = Species, alpha = 0.5))

These figures indicate that the three Iris species cluster naturally in these dimensions, with minimal overlap. As a result, these data provide an excellent test for different machine learning algorithms.
First, however, we will generate one scatter plot that displays a larger version of the Sepal Width versus Petal Width scatter plot to highlight the inherent structure in these data. Furthermore, we will refer back to this plot in later analyses in this Module.
#crash course in plots?
ggplot(data=iris,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Species))+geom_point(alpha=0.5)

#line graph..this is a TERRIBLE example...
#ask me why
iris.setosa<-iris%>%
filter(Species=="setosa")%>%
distinct(Sepal.Width,.keep_all = TRUE)
ggplot(data=iris.setosa,mapping = aes(x=Sepal.Width,y=Sepal.Length))+geom_line()

ggplot(data=iris,mapping = aes(x=Sepal.Length,color=Species))+geom_histogram()

ggplot(data=iris,mapping = aes(x=Species,y=Sepal.Length))+geom_boxplot()

ggplot(data=iris,mapping = aes(x=Species))+geom_bar()

Machine learning algorithms can be classified by the method in which they are constructed. Supervised learning methods use training data to build a model, which is subsequently applied to additional data. On the other hand, unsupervised learning methods seek relationships among data points that can be leveraged to construct a model that is subsequently applied to the data of interest. In some cases, training data are used to validate the effectiveness of an unsupervised method, or perhaps to provide some level of supervision, which is known as semi-supervised learning.
More recently, additional types of learning have been developed. First, transfer learning extends a model trained on a previous data set to new, related data. This can be viewed as learning by analogy, which is similar to how humans learn. Second, reinforcement learning is a technique that explores how agents should behave within an environment by maximizing a cumulative reward. Finally, deep learning applies artificial neural networks that have multiple hidden layers to complex tasks, often with spectacular success in areas from image recognition to natural language processing.
Broadly speaking, the application of a machine learning algorithm will be one of four different categories:
Classification: generates a model that predicts discrete categories for new, unseen data.
Regression: generates a model that predicts continuous values for new, unseen data.
Dimensionality reduction: identifies (and optionally ranks) the most important (potentially new) features (or dimensions) for a data set.
Clustering: identifies clusters of instances in an N-dimensional feature space.

One final point to clarify before proceeding with demonstrations of these different algorithm categories. When applying a machine learning algorithm to a problem, we often need to specify both model parameters and model hyperparameters. While they are similar, the difference between these two types of information depends on whether the value can be estimated from the data.
A value that can be estimated from the data being analyzed and that is internal to the machine learning algorithm. A parameter is generally not specified by the programmer, and instead is determined automatically by the algorithm implementation (e.g., directly in the caret package). For example, the coefficients in a linear regression model are machine learning parameters.
A value that cannot be estimated from the data being analyzed and that is external to a specific machine learning algorithm. A hyperparameter is generally specified by the programmer prior to the start of the learning process. As a result, the hyperparameter directly influences the performance of the algorithm and thus is a tunable parameter. For example, the number of neighbors in a k-nearest neighbors implementation is a hyperparameter.
Caret stands for Classification And Regression Training. Apparently caret has little to do with our orange friend, the carrot. 🥕
Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:
Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness)
Data splitting
Variable selection
Model evaluation
An extensive vignette for caret can be found here: https://topepo.github.io/caret/index.html
Before we can apply a machine learning algorithm to the data of interest, we must divide the data into training and testing data sets. The training data are used to generate the supervised model, while the testing data are used to quantify the quality of the generated model. The function createDataPartition can be used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data. For example, to create a single 60/40% split of the iris data:
#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(iris$Species, p = .6, list = FALSE, times = 1)
#look at the first few
head(trainIndex)
Resample1
[1,] 1
[2,] 2
[3,] 4
[4,] 5
[5,] 7
[6,] 8
#grab the data
irisTrain <- iris[ trainIndex,]
irisTest <- iris[-trainIndex,]
Many machine learning estimators are sensitive to variations in the spread of features within a data set. For example, if all features but one span similar ranges (e.g., zero through one) and one feature spans a much larger range (e.g., zero through one hundred), an algorithm might focus on the one feature with a larger spread, even if this produces a sub-optimal result. To prevent this, we generally scale the features to improve the performance of a given estimator.
Data scaling can take several forms:
Standardization: the data are scaled to have zero mean and unit (i.e., one) variance.
Normalization: the data are scaled to have unit mean and variance.
Range: the data are scaled to span a defined range, such as \[0,1\].
Binarization: the data are thresholded such that values below the threshold are zero (or False), and above the threshold are one (or True).
One important caveat to scaling is that any scaling technique should be trained via the fit method on the training data used for the machine learning algorithm. Once trained, the scaling technique can be applied equally to the training and testing data. In this manner, the testing data will always match the space spanned by the training data, which is what is used to generate the predictive model.
We demonstrate this approach in the following code cell, where we compute a standardization from our training data. This transformation is applied to both the training and testing data.
preProcValues <- preProcess(irisTrain, method = c("center", "scale"))
trainTransformed <- predict(preProcValues, irisTrain)
testTransformed <- predict(preProcValues, irisTest)
I made a mistake here…Can you spot it 👀
preProcValues <- preProcess(irisTest, method = c("center", "scale"))
testTransformed <- predict(preProcValues, irisTest)
psych::describe(trainTransformed)
vars n mean sd median trimmed mad min max range
Sepal.Length 1 90 0 1.00 -0.11 -0.05 1.04 -1.75 2.46 4.22
Sepal.Width 2 90 0 1.00 -0.08 -0.04 0.85 -1.93 3.14 5.07
Petal.Length 3 90 0 1.00 0.36 0.00 1.01 -1.50 1.67 3.17
Petal.Width 4 90 0 1.00 0.14 -0.01 1.39 -1.46 1.75 3.20
Species* 5 90 2 0.82 2.00 2.00 1.48 1.00 3.00 2.00
skew kurtosis se
Sepal.Length 0.35 -0.58 0.11
Sepal.Width 0.45 0.23 0.11
Petal.Length -0.28 -1.44 0.11
Petal.Width -0.16 -1.37 0.11
Species* 0.00 -1.53 0.09
psych::describe(testTransformed)
vars n mean sd median trimmed mad min max range
Sepal.Length 1 60 0 1.00 -0.14 -0.03 1.13 -1.66 2.26 3.92
Sepal.Width 2 60 0 1.00 0.03 -0.02 1.01 -2.47 2.52 5.00
Petal.Length 3 60 0 1.00 0.28 0.00 1.00 -1.55 1.77 3.32
Petal.Width 4 60 0 1.00 0.11 -0.02 1.41 -1.41 1.64 3.05
Species* 5 60 2 0.82 2.00 2.00 1.48 1.00 3.00 2.00
skew kurtosis se
Sepal.Length 0.27 -0.75 0.13
Sepal.Width 0.11 0.00 0.13
Petal.Length -0.25 -1.43 0.13
Petal.Width -0.03 -1.40 0.13
Species* 0.00 -1.55 0.11
With our data properly divided into training and testing samples, and the features appropriately scaled, we now change to the application of machine learning algorithms
The first type of algorithm we will demonstrate is classification, where we train an estimator to generate a model for the prediction of discrete labels. The following code cell completes this task by performing k-Nearest Neighbors classification. In this example, we use five nearest neighbors (but this value can be easily adjusted to see how the classification performance changes). As demonstrated in this code example, the standard classification process in caret is to first fit a model to the training data and to subsequently apply this model to predict values for the testing data. We can compute an accuracy measurement for our trained algorithm to compare the predicted and known labels for the testing data.
Since we set the k there is no reason to actually train… 😖
#fit knn
knn_fit<-train(Species~.,
data=trainTransformed,
method="knn",
tuneGrid=data.frame(k=5))
knn_fit
k-Nearest Neighbors
90 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results:
Accuracy Kappa
0.9214823 0.8808281
Tuning parameter 'k' was held constant at a value of 5
#predict on the test set
knn_pred<-predict(knn_fit,testTransformed)
#confusion
confusionMatrix(knn_pred,testTransformed$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 20 0 0
versicolor 0 20 1
virginica 0 0 19
Overall Statistics
Accuracy : 0.9833
95% CI : (0.9106, 0.9996)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.975
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.9500
Specificity 1.0000 0.9750 1.0000
Pos Pred Value 1.0000 0.9524 1.0000
Neg Pred Value 1.0000 1.0000 0.9756
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.3167
Detection Prevalence 0.3333 0.3500 0.3167
Balanced Accuracy 1.0000 0.9875 0.9750
The second machine learning application we will demonstrate is regression. To demonstrate regression, we will introduce the Decision Tree. A decision tree simply asks a set of questions of the data, and based on the answers, constructs a model representation. The tree (or model) is constructed by recursively splitting a data set into new groupings based on a statistical measure of the data along each different dimension (popular measures include the Gini coefficient or the entropy).
The terminal nodes in the tree are known as leaf nodes, and they provide the final predictions. In the simplest form, the leaf node simply provides the final prediction. More realistic decision trees generate a model prediction by using all instances in the leaf node, for example by averaging across them.
Before generating a regression model, however, we must pre-process our data to identify our independent variables (or features) and our dependent variable (or feature). Given a set of new independent variables, a regression model will predict the dependent variable. In the following code cell, we first select the first three features to be our independent variables and the fourth variable to be our dependent variable. We divide these into training and testing samples.
#fit Decision Tree
DT_fit1<-train(Petal.Width~Sepal.Length+Sepal.Width+Petal.Length,
data=trainTransformed,
method="rpart")
DT_fit1
CART
90 samples
3 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 90, 90, 90, 90, 90, 90, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.01387911 0.2907516 0.9197935 0.2216197
0.11645306 0.3779289 0.8533688 0.2918127
0.81157569 0.6976834 0.7887978 0.5822100
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01387911.
Install another package
install.packages("rpart.plot")
We can plot simple trees
rpart.plot::prp(DT_fit1$finalModel,box.palette = "Reds", tweak = 1.2)

Lets predict
#predict on the test set
DTfit1_pred<-predict(DT_fit1,testTransformed)
preds<-DTfit1_pred
actual<-testTransformed$Petal.Width
#one formulation
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
rsq <- 1 - rss/tss
rsq
[1] 0.9049342
[1] 0.9215393
#another
cor(preds,actual)^2
[1] 0.905095
R-squared is a statistic that often accompanies regression output. It ranges in value from 0 to 1 and is usually interpreted as summarizing the percent of variation in the response that the regression model explains. So an R-squared of 0.65 might mean that the model explains about 65% of the variation in our dependent variable. Given this logic, we prefer our regression models have a high R-squared.
In R, we typically get R-squared by calling the summary function on a model object. Here’s a quick example using simulated data:
# independent variable
x <- 1:20
# for reproducibility
set.seed(1)
# dependent variable; function of x with random error
y <- 2 + 0.5*x + rnorm(20,0,3)
# simple linear regression
mod <- lm(y~x)
# request just the r-squared value
summary(mod)$r.squared
[1] 0.6026682
One way to express R-squared is as the sum of squared fitted-value deviations divided by the sum of squared original-value deviations:
\[ R^{2} = \frac{\sum (\hat{y} – \bar{\hat{y}})^{2}}{\sum (y – \bar{y})^{2}} \]
We can calculate it directly using our model object like so:
# extract fitted (or predicted) values from model
f <- mod$fitted.values
# sum of squared fitted-value deviations
mss <- sum((f - mean(f))^2)
# sum of squared original-value deviations
tss <- sum((y - mean(y))^2)
# r-squared
mss/tss
[1] 0.6026682
1. R-squared does not measure goodness of fit. It can be arbitrarily low when the model is completely correct. By making\(σ^2\) large, we drive R-squared towards 0, even when every assumption of the simple linear regression model is correct in every particular.
What is \(σ^2\)? When we perform linear regression, we assume our model almost predicts our dependent variable. The difference between “almost” and “exact” is assumed to be a draw from a Normal distribution with mean 0 and some variance we call \(σ^2\).
This statement is easy enough to demonstrate. The way we do it here is to create a function that (1) generates data meeting the assumptions of simple linear regression (independent observations, normally distributed errors with constant variance), (2) fits a simple linear model to the data, and (3) reports the R-squared. Notice the only parameter for sake of simplicity is sigma. We then “apply” this function to a series of increasing \(σ\) values and plot the results.
r2.0 <- function(sig){
# our predictor
x <- seq(1,10,length.out = 100)
# our response; a function of x plus some random noise
y <- 2 + 1.2*x + rnorm(100,0,sd = sig)
# print the R-squared value
summary(lm(y ~ x))$r.squared
}
sigmas <- seq(0.5,20,length.out = 20)
# apply our function to a series of sigma values
rout <- sapply(sigmas, r2.0)
plot(rout ~ sigmas, type="b")

R-squared tanks hard with increasing sigma, even though the model is completely correct in every respect.
The point being made is that R-squared does not measure goodness of fit.
set.seed(1)
# our predictor is data from an exponential distribution
x <- rexp(50,rate=0.005)
# non-linear data generation
y <- (x-1)^2 * runif(50, min=0.8, max=1.2)
# clearly non-linear
plot(x,y)

It’s very high at about 0.85, but the model is completely wrong. Using R-squared to justify the “goodness” of our model in this instance would be a mistake. Hopefully one would plot the data first and recognize that a simple linear regression in this case would be inappropriate.
3. R-squared says nothing about prediction error, even with \(σ^2\) exactly the same, and no change in the coefficients. R-squared can be anywhere between 0 and 1 just by changing the range of X. We’re better off using Mean Square Error (MSE) as a measure of prediction error.
MSE is basically the fitted y values minus the observed y values, squared, then summed, and then divided by the number of observations.
Let’s demonstrate this statement by first generating data that meets all simple linear regression assumptions and then regressing y on x to assess both R-squared and MSE.
x <- seq(1,10,length.out = 100)
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
mod1 <- lm(y ~ x)
summary(mod1)$r.squared
[1] 0.9383379
[1] 0.6468052
Now repeat the above code, but this time with a different range of x. Leave everything else the same:
# new range of x
x <- seq(1,2,length.out = 100)
set.seed(1)
y <- 2 + 1.2*x + rnorm(100,0,sd = 0.9)
mod1 <- lm(y ~ x)
summary(mod1)$r.squared
[1] 0.1502448
[1] 0.6468052
The R-squared falls from 0.94 to 0.15 but the MSE remains the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power.
Let’s examine this by generating data that would benefit from transformation. Notice the R code below is very much like our previous efforts but now we exponentiate our y variable.
x <- seq(1,2,length.out = 100)
set.seed(1)
y <- exp(-2 - 0.09*x + rnorm(100,0,sd = 2.5))
summary(lm(y ~ x))$r.squared
[1] 0.003281718

R-squared is very low and our residuals vs. fitted plot reveals outliers and non-constant variance. A common fix for this is to log transform the data. Let’s try that and see what happens:
The diagnostic plot looks much better. Our assumption of constant variance appears to be met. But look at the R-squared:
It’s even lower! This is an extreme case and it doesn’t always happen like this. In fact, a log transformation will usually produce an increase in R-squared. But as just demonstrated, assumptions that are better fulfilled don’t always lead to higher R-squared.
This is the easiest statement to demonstrate:
[1] 0.737738
[1] 0.737738
Does x explain y, or does y explain x? Are we saying “explain” to dance around the word “cause”? In a simple scenario with two variables such as this, R-squared is simply the square of the correlation between x and y:
Let’s recap:
R-squared does not measure goodness of fit.
R-squared does not measure predictive error.
R-squared does not necessarily increase when assumptions are better satisfied.
R-squared does not measure how one variable explains another.
When confronted with a large, multi-dimensional data set, one approach to simplify any subsequent analysis is to reduce the number of dimensions (or features) that must be processed. In some cases, features can be removed from an analysis based on business logic, or the features that contain the most information can be quantified somehow. More generally, however, we can employ dimensional reduction, a machine learning technique that quantifies relationships between the original dimensions (or features, attributes, or columns of a DataFrame) to identify new dimensions that better capture the inherent relationships within the data.
The standard technique to perform this is known as principal component analysis, or PCA. Mathematically, we can derive PCA by using linear algebra to solve a set of linear equations. This process effectively rotates the data into a new set of dimensions, and by ranking the importance of the new dimensions, we can optimally select fewer dimensions for use in other machine learning algorithms.
The PCA estimator requires one tunable hyper-parameter that specifies the target number of dimensions. This value can be arbitrarily selected, perhaps based on prior information, or it can be iteratively determined. After the model is created, we fit the model to the data and next create our new, rotated data set. This is demonstrated in the next code cell.
library(caret)
#store our data in another object
dat <- iris
#take the 4 continuous variables and perform PCA
caret.pca <- preProcess(dat[,-5], method="pca",pcaComp=2)
caret.pca
Created from 150 samples and 4 variables
Pre-processing:
- centered (4)
- ignored (0)
- principal component signal extraction (4)
- scaled (4)
PCA used 2 components as specified
caret.pca$
#use that data to form our new inputs
dat2 <- predict(caret.pca, dat[,-5])
#using stats
stat.pca <- prcomp(dat[,-5],
center = TRUE,
scale. = TRUE)
# plot method
plot(stat.pca, type = "l")
summary(stat.pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
Below is a graphical representation5

At the end of the previous code cell, we measure the amount of the original variance (or spread) in the original data that is captured by each new dimension. As this example shows, these two new dimensions capture almost 96% of the variance in the original data. This means that any analysis that uses only these two new dimensions will closely represent the analysis if performed on the entire data.
The last machine learning technique we will explore in this notebook is cluster finding. In this introductory notebook, we will demonstrate one of the simplest clustering techniques, spatial clustering, which seeks to first find NN clusters in a data set and to subsequently identify to which cluster each instance (or data point) belongs. The specific algorithm we employ below is the k-means algorithm, which is one of the simplest to understand. In this algorithm, we start with a guess for the number of clusters (again this can be based on prior information or iteratively quantified). We randomly place cluster centers in the data and determine how well the data cluster to these cluster centers. This information is used to pick new cluster centers, and the process continues until a solution converges (or we reach a predefined number of iterations).
Clusters<-kmeans(trainTransformed[,-5],centers=3)
Clusters
K-means clustering with 3 clusters of sizes 37, 23, 30
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 0.07988207 -0.8067389 0.4033988 0.3820683
2 1.21087562 0.3061693 1.0479522 1.0378882
3 -1.02685919 0.7602481 -1.3009553 -1.2669319
Clustering vector:
1 2 4 5 7 8 9 14 16 17 20 21 22 24 25 26 27
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
29 30 31 33 36 38 39 40 41 42 43 46 47 51 54 56 57
3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 2
58 59 60 62 63 64 65 66 70 73 75 77 78 79 81 83 84
1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1
85 86 88 89 91 92 97 99 100 102 103 104 105 106 107 108 109
1 2 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1
110 111 114 115 117 118 120 122 124 125 126 127 128 129 132 133 134
2 2 1 1 2 2 1 1 1 2 2 1 1 2 2 2 1
136 137 138 141 147
2 2 2 2 1
Within cluster sum of squares by cluster:
[1] 31.46525 24.54995 30.43175
(between_SS / total_SS = 75.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
The above list is an output of the kmeans() function. Let’s see some of the important ones closely:
cluster: a vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers: a matrix of cluster centers.
withinss: vector of within-cluster sum of squares, one component per cluster.
tot.withinss: total within-cluster sum of squares. That is, sum(withinss).
size: the number of points in each cluster.
library(tidyverse)
Clusterdata<-trainTransformed
Clusterdata$Cluster<-as.factor(Clusters$cluster)
#view the whole dataset
knitr::kable(Clusterdata)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "100%",height="300px")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Cluster | |
|---|---|---|---|---|---|---|
| 1 | -0.8160212 | 1.0674191 | -1.3349260 | -1.3247894 | setosa | 3 |
| 2 | -1.0502856 | -0.0844720 | -1.3349260 | -1.3247894 | setosa | 3 |
| 4 | -1.4016823 | 0.1459062 | -1.2783081 | -1.3247894 | setosa | 3 |
| 5 | -0.9331534 | 1.2977973 | -1.3349260 | -1.3247894 | setosa | 3 |
| 7 | -1.4016823 | 0.8370409 | -1.3349260 | -1.1912720 | setosa | 3 |
| 8 | -0.9331534 | 0.8370409 | -1.2783081 | -1.3247894 | setosa | 3 |
| 9 | -1.6359468 | -0.3148502 | -1.3349260 | -1.3247894 | setosa | 3 |
| 14 | -1.7530790 | -0.0844720 | -1.5047800 | -1.4583068 | setosa | 3 |
| 16 | -0.1132278 | 3.1408231 | -1.2783081 | -1.0577546 | setosa | 3 |
| 17 | -0.4646245 | 1.9889320 | -1.3915440 | -1.0577546 | setosa | 3 |
| 20 | -0.8160212 | 1.7585538 | -1.2783081 | -1.1912720 | setosa | 3 |
| 21 | -0.4646245 | 0.8370409 | -1.1650721 | -1.3247894 | setosa | 3 |
| 22 | -0.8160212 | 1.5281755 | -1.2783081 | -1.0577546 | setosa | 3 |
| 24 | -0.8160212 | 0.6066626 | -1.1650721 | -0.9242372 | setosa | 3 |
| 25 | -1.1674179 | 0.8370409 | -1.0518362 | -1.3247894 | setosa | 3 |
| 26 | -0.9331534 | -0.0844720 | -1.2216901 | -1.3247894 | setosa | 3 |
| 27 | -0.9331534 | 0.8370409 | -1.2216901 | -1.0577546 | setosa | 3 |
| 29 | -0.6988890 | 0.8370409 | -1.3349260 | -1.3247894 | setosa | 3 |
| 30 | -1.2845501 | 0.3762844 | -1.2216901 | -1.3247894 | setosa | 3 |
| 31 | -1.1674179 | 0.1459062 | -1.2216901 | -1.3247894 | setosa | 3 |
| 33 | -0.6988890 | 2.4496884 | -1.2783081 | -1.4583068 | setosa | 3 |
| 36 | -0.9331534 | 0.3762844 | -1.4481620 | -1.3247894 | setosa | 3 |
| 38 | -1.0502856 | 1.2977973 | -1.3349260 | -1.4583068 | setosa | 3 |
| 39 | -1.6359468 | -0.0844720 | -1.3915440 | -1.3247894 | setosa | 3 |
| 40 | -0.8160212 | 0.8370409 | -1.2783081 | -1.3247894 | setosa | 3 |
| 41 | -0.9331534 | 1.0674191 | -1.3915440 | -1.1912720 | setosa | 3 |
| 42 | -1.5188145 | -1.6971196 | -1.3915440 | -1.1912720 | setosa | 3 |
| 43 | -1.6359468 | 0.3762844 | -1.3915440 | -1.3247894 | setosa | 3 |
| 46 | -1.1674179 | -0.0844720 | -1.3349260 | -1.1912720 | setosa | 3 |
| 47 | -0.8160212 | 1.7585538 | -1.2216901 | -1.3247894 | setosa | 3 |
| 51 | 1.4094911 | 0.3762844 | 0.5334671 | 0.2774195 | versicolor | 2 |
| 54 | -0.3474923 | -1.6971196 | 0.1371413 | 0.1439021 | versicolor | 1 |
| 56 | -0.1132278 | -0.5452285 | 0.4202312 | 0.1439021 | versicolor | 1 |
| 57 | 0.5895655 | 0.6066626 | 0.5334671 | 0.5444543 | versicolor | 2 |
| 58 | -1.0502856 | -1.4667413 | -0.2591845 | -0.2566501 | versicolor | 1 |
| 59 | 0.9409622 | -0.3148502 | 0.4768492 | 0.1439021 | versicolor | 1 |
| 60 | -0.6988890 | -0.7756067 | 0.0805233 | 0.2774195 | versicolor | 1 |
| 62 | 0.1210366 | -0.0844720 | 0.2503773 | 0.4109369 | versicolor | 1 |
| 63 | 0.2381689 | -1.9274978 | 0.1371413 | -0.2566501 | versicolor | 1 |
| 64 | 0.3553011 | -0.3148502 | 0.5334671 | 0.2774195 | versicolor | 1 |
| 65 | -0.2303600 | -0.3148502 | -0.0893306 | 0.1439021 | versicolor | 1 |
| 66 | 1.0580945 | 0.1459062 | 0.3636132 | 0.2774195 | versicolor | 2 |
| 70 | -0.2303600 | -1.2363631 | 0.0805233 | -0.1231327 | versicolor | 1 |
| 73 | 0.5895655 | -1.2363631 | 0.6467031 | 0.4109369 | versicolor | 1 |
| 75 | 0.7066978 | -0.3148502 | 0.3069952 | 0.1439021 | versicolor | 1 |
| 77 | 1.1752267 | -0.5452285 | 0.5900851 | 0.2774195 | versicolor | 1 |
| 78 | 1.0580945 | -0.0844720 | 0.7033211 | 0.6779717 | versicolor | 2 |
| 79 | 0.2381689 | -0.3148502 | 0.4202312 | 0.4109369 | versicolor | 1 |
| 81 | -0.3474923 | -1.4667413 | 0.0239054 | -0.1231327 | versicolor | 1 |
| 83 | 0.0039044 | -0.7756067 | 0.0805233 | 0.0103847 | versicolor | 1 |
| 84 | 0.2381689 | -0.7756067 | 0.7599390 | 0.5444543 | versicolor | 1 |
| 85 | -0.4646245 | -0.0844720 | 0.4202312 | 0.4109369 | versicolor | 1 |
| 86 | 0.2381689 | 0.8370409 | 0.4202312 | 0.5444543 | versicolor | 2 |
| 88 | 0.5895655 | -1.6971196 | 0.3636132 | 0.1439021 | versicolor | 1 |
| 89 | -0.2303600 | -0.0844720 | 0.1937593 | 0.1439021 | versicolor | 1 |
| 91 | -0.3474923 | -1.0059849 | 0.3636132 | 0.0103847 | versicolor | 1 |
| 92 | 0.3553011 | -0.0844720 | 0.4768492 | 0.2774195 | versicolor | 1 |
| 97 | -0.1132278 | -0.3148502 | 0.2503773 | 0.1439021 | versicolor | 1 |
| 99 | -0.8160212 | -1.2363631 | -0.4290384 | -0.1231327 | versicolor | 1 |
| 100 | -0.1132278 | -0.5452285 | 0.1937593 | 0.1439021 | versicolor | 1 |
| 102 | 0.0039044 | -0.7756067 | 0.7599390 | 0.9450066 | virginica | 1 |
| 103 | 1.5266234 | -0.0844720 | 1.2128828 | 1.2120414 | virginica | 2 |
| 104 | 0.5895655 | -0.3148502 | 1.0430289 | 0.8114891 | virginica | 2 |
| 105 | 0.8238300 | -0.0844720 | 1.1562649 | 1.3455588 | virginica | 2 |
| 106 | 2.1122845 | -0.0844720 | 1.6092087 | 1.2120414 | virginica | 2 |
| 107 | -1.0502856 | -1.2363631 | 0.4202312 | 0.6779717 | virginica | 1 |
| 108 | 1.7608878 | -0.3148502 | 1.4393547 | 0.8114891 | virginica | 2 |
| 109 | 1.0580945 | -1.2363631 | 1.1562649 | 0.8114891 | virginica | 1 |
| 110 | 1.6437556 | 1.2977973 | 1.3261188 | 1.7461110 | virginica | 2 |
| 111 | 0.8238300 | 0.3762844 | 0.7599390 | 1.0785240 | virginica | 2 |
| 114 | -0.1132278 | -1.2363631 | 0.7033211 | 1.0785240 | virginica | 1 |
| 115 | 0.0039044 | -0.5452285 | 0.7599390 | 1.6125936 | virginica | 1 |
| 117 | 0.8238300 | -0.0844720 | 0.9864109 | 0.8114891 | virginica | 2 |
| 118 | 2.2294167 | 1.7585538 | 1.6658266 | 1.3455588 | virginica | 2 |
| 120 | 0.2381689 | -1.9274978 | 0.7033211 | 0.4109369 | virginica | 1 |
| 122 | -0.2303600 | -0.5452285 | 0.6467031 | 1.0785240 | virginica | 1 |
| 124 | 0.5895655 | -0.7756067 | 0.6467031 | 0.8114891 | virginica | 1 |
| 125 | 1.0580945 | 0.6066626 | 1.0996469 | 1.2120414 | virginica | 2 |
| 126 | 1.6437556 | 0.3762844 | 1.2695008 | 0.8114891 | virginica | 2 |
| 127 | 0.4724333 | -0.5452285 | 0.5900851 | 0.8114891 | virginica | 1 |
| 128 | 0.3553011 | -0.0844720 | 0.6467031 | 0.8114891 | virginica | 1 |
| 129 | 0.7066978 | -0.5452285 | 1.0430289 | 1.2120414 | virginica | 2 |
| 132 | 2.4636812 | 1.7585538 | 1.4959727 | 1.0785240 | virginica | 2 |
| 133 | 0.7066978 | -0.5452285 | 1.0430289 | 1.3455588 | virginica | 2 |
| 134 | 0.5895655 | -0.5452285 | 0.7599390 | 0.4109369 | virginica | 1 |
| 136 | 2.2294167 | -0.0844720 | 1.3261188 | 1.4790762 | virginica | 2 |
| 137 | 0.5895655 | 0.8370409 | 1.0430289 | 1.6125936 | virginica | 2 |
| 138 | 0.7066978 | 0.1459062 | 0.9864109 | 0.8114891 | virginica | 2 |
| 141 | 1.0580945 | 0.1459062 | 1.0430289 | 1.6125936 | virginica | 2 |
| 147 | 0.5895655 | -1.2363631 | 0.7033211 | 0.9450066 | virginica | 1 |
#Remember me
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Cluster))+geom_point(alpha=0.5)
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Cluster))+geom_point(alpha=0.5)+facet_wrap(~Species)
ggplot(data=Clusterdata,mapping = aes(x=Sepal.Width,y=Petal.Width,color=Species))+geom_point(alpha=0.5) +
geom_point(data=as.data.frame(Clusters$centers), aes(color="Cluster center"), size=5) + theme(legend.title = element_blank())+ggtitle("Iris Cluster Demonstration")

Using the code above, answer the following questions.
Change the p=.6 to p=.75 in the [Data Pre-Processing] section. How did the classification results change?
Change the p=.6 to p=.4 in the [Data Pre-Processing] section. How did the classification results change?
Change the k hyper-parameter in the k-nn estimator to three (and ten). In other words change the 5 to 10 in tuneGrid=data.frame(k=5) in the [Classification]section. How did the classification results change?
Change the p=.6 to p=.75 in the [Data Pre-Processing] section. How did the regression results change?
Change the pcaComp hyper-parameter in the PCA code example to three (and four) in the [Dimensionality Reduction] section. What are the new explained variances?
Change the centers hyper-parameter in the cluster finding code example to two (and four) in the [Clustering]. Where are the new cluster centers? Does this look better or worse?
What does the set.seed function in R do? Why use it? Should we have used it above?
As the previous code cells demonstrate, we can generate machine learning models rather easily for small data sets by using the caret library. For larger data sets, however, either in the number of instances, the number of features, or both, building a quality machine learning model can take considerable time and effort. As a result, we may wish to persist a trained machine learning model so that it can be applied to new data at a later time.
#load our model
load(file = "DT_fit1.Rda")
Whenever you load it back you can use it just like before.
Regression
Regression is awesome
K-NN
K-NN is simple, but awesome
This lesson builds on the ordinary linear regression concept, introduced in business statistics, to discuss linear regression as a machine learning task. Regression, or the creation of a predictive model from data, is one of the key machine learning tasks. By using linear regression, which can often be solved analytically, we will introduce the concept of minimizing a cost function to determine the optimal model parameters in a straight-forward manner.
By the end of this lesson, you will be able to
You were introduced to the concept of linear regression by learning about simple linear regression. This initial approach treated linear regression as a statistical technique where the relation between independent variables (or features) and a dependent variable (or target) was determined by a mathematical relation. While powerful, the previous approach treated linear regression as a distinct, statistical approach to relating independent variables with a dependent variable. In this lesson, we instead treat linear regression as a machine learning task. As a result, we will use regression to fit a model to data. The model generated in this fashion can be explored in greater detail to either understand why the provided data follow the generated model (i.e., gain insight into the data), or the model can be used to generate new dependent values from future or unseen data (i.e., make predictions from the model).
We will use the tips data set. After loading this data, we display several rows, and next compute a simple linear regression to predict the tip feature from the total_bill feature.
For this section we need to load libraries:
head(tips,5)
# A tibble: 5 x 7
total_bill tip sex smoker day time size
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 17.0 1.01 Female No Sun Dinner 2
2 10.3 1.66 Male No Sun Dinner 3
3 21.0 3.5 Male No Sun Dinner 3
4 23.7 3.31 Male No Sun Dinner 2
5 24.6 3.61 Female No Sun Dinner 4
#view the whole dataset
knitr::kable(tips)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "100%",height="300px")
| total_bill | tip | sex | smoker | day | time | size |
|---|---|---|---|---|---|---|
| 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
| 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
| 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
| 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
| 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
| 14.78 | 3.23 | Male | No | Sun | Dinner | 2 |
| 10.27 | 1.71 | Male | No | Sun | Dinner | 2 |
| 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
| 15.42 | 1.57 | Male | No | Sun | Dinner | 2 |
| 18.43 | 3.00 | Male | No | Sun | Dinner | 4 |
| 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
| 21.58 | 3.92 | Male | No | Sun | Dinner | 2 |
| 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
| 16.29 | 3.71 | Male | No | Sun | Dinner | 3 |
| 16.97 | 3.50 | Female | No | Sun | Dinner | 3 |
| 20.65 | 3.35 | Male | No | Sat | Dinner | 3 |
| 17.92 | 4.08 | Male | No | Sat | Dinner | 2 |
| 20.29 | 2.75 | Female | No | Sat | Dinner | 2 |
| 15.77 | 2.23 | Female | No | Sat | Dinner | 2 |
| 39.42 | 7.58 | Male | No | Sat | Dinner | 4 |
| 19.82 | 3.18 | Male | No | Sat | Dinner | 2 |
| 17.81 | 2.34 | Male | No | Sat | Dinner | 4 |
| 13.37 | 2.00 | Male | No | Sat | Dinner | 2 |
| 12.69 | 2.00 | Male | No | Sat | Dinner | 2 |
| 21.70 | 4.30 | Male | No | Sat | Dinner | 2 |
| 19.65 | 3.00 | Female | No | Sat | Dinner | 2 |
| 9.55 | 1.45 | Male | No | Sat | Dinner | 2 |
| 18.35 | 2.50 | Male | No | Sat | Dinner | 4 |
| 15.06 | 3.00 | Female | No | Sat | Dinner | 2 |
| 20.69 | 2.45 | Female | No | Sat | Dinner | 4 |
| 17.78 | 3.27 | Male | No | Sat | Dinner | 2 |
| 24.06 | 3.60 | Male | No | Sat | Dinner | 3 |
| 16.31 | 2.00 | Male | No | Sat | Dinner | 3 |
| 16.93 | 3.07 | Female | No | Sat | Dinner | 3 |
| 18.69 | 2.31 | Male | No | Sat | Dinner | 3 |
| 31.27 | 5.00 | Male | No | Sat | Dinner | 3 |
| 16.04 | 2.24 | Male | No | Sat | Dinner | 3 |
| 17.46 | 2.54 | Male | No | Sun | Dinner | 2 |
| 13.94 | 3.06 | Male | No | Sun | Dinner | 2 |
| 9.68 | 1.32 | Male | No | Sun | Dinner | 2 |
| 30.40 | 5.60 | Male | No | Sun | Dinner | 4 |
| 18.29 | 3.00 | Male | No | Sun | Dinner | 2 |
| 22.23 | 5.00 | Male | No | Sun | Dinner | 2 |
| 32.40 | 6.00 | Male | No | Sun | Dinner | 4 |
| 28.55 | 2.05 | Male | No | Sun | Dinner | 3 |
| 18.04 | 3.00 | Male | No | Sun | Dinner | 2 |
| 12.54 | 2.50 | Male | No | Sun | Dinner | 2 |
| 10.29 | 2.60 | Female | No | Sun | Dinner | 2 |
| 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
| 9.94 | 1.56 | Male | No | Sun | Dinner | 2 |
| 25.56 | 4.34 | Male | No | Sun | Dinner | 4 |
| 19.49 | 3.51 | Male | No | Sun | Dinner | 2 |
| 38.01 | 3.00 | Male | Yes | Sat | Dinner | 4 |
| 26.41 | 1.50 | Female | No | Sat | Dinner | 2 |
| 11.24 | 1.76 | Male | Yes | Sat | Dinner | 2 |
| 48.27 | 6.73 | Male | No | Sat | Dinner | 4 |
| 20.29 | 3.21 | Male | Yes | Sat | Dinner | 2 |
| 13.81 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 11.02 | 1.98 | Male | Yes | Sat | Dinner | 2 |
| 18.29 | 3.76 | Male | Yes | Sat | Dinner | 4 |
| 17.59 | 2.64 | Male | No | Sat | Dinner | 3 |
| 20.08 | 3.15 | Male | No | Sat | Dinner | 3 |
| 16.45 | 2.47 | Female | No | Sat | Dinner | 2 |
| 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
| 20.23 | 2.01 | Male | No | Sat | Dinner | 2 |
| 15.01 | 2.09 | Male | Yes | Sat | Dinner | 2 |
| 12.02 | 1.97 | Male | No | Sat | Dinner | 2 |
| 17.07 | 3.00 | Female | No | Sat | Dinner | 3 |
| 26.86 | 3.14 | Female | Yes | Sat | Dinner | 2 |
| 25.28 | 5.00 | Female | Yes | Sat | Dinner | 2 |
| 14.73 | 2.20 | Female | No | Sat | Dinner | 2 |
| 10.51 | 1.25 | Male | No | Sat | Dinner | 2 |
| 17.92 | 3.08 | Male | Yes | Sat | Dinner | 2 |
| 27.20 | 4.00 | Male | No | Thur | Lunch | 4 |
| 22.76 | 3.00 | Male | No | Thur | Lunch | 2 |
| 17.29 | 2.71 | Male | No | Thur | Lunch | 2 |
| 19.44 | 3.00 | Male | Yes | Thur | Lunch | 2 |
| 16.66 | 3.40 | Male | No | Thur | Lunch | 2 |
| 10.07 | 1.83 | Female | No | Thur | Lunch | 1 |
| 32.68 | 5.00 | Male | Yes | Thur | Lunch | 2 |
| 15.98 | 2.03 | Male | No | Thur | Lunch | 2 |
| 34.83 | 5.17 | Female | No | Thur | Lunch | 4 |
| 13.03 | 2.00 | Male | No | Thur | Lunch | 2 |
| 18.28 | 4.00 | Male | No | Thur | Lunch | 2 |
| 24.71 | 5.85 | Male | No | Thur | Lunch | 2 |
| 21.16 | 3.00 | Male | No | Thur | Lunch | 2 |
| 28.97 | 3.00 | Male | Yes | Fri | Dinner | 2 |
| 22.49 | 3.50 | Male | No | Fri | Dinner | 2 |
| 5.75 | 1.00 | Female | Yes | Fri | Dinner | 2 |
| 16.32 | 4.30 | Female | Yes | Fri | Dinner | 2 |
| 22.75 | 3.25 | Female | No | Fri | Dinner | 2 |
| 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 |
| 27.28 | 4.00 | Male | Yes | Fri | Dinner | 2 |
| 12.03 | 1.50 | Male | Yes | Fri | Dinner | 2 |
| 21.01 | 3.00 | Male | Yes | Fri | Dinner | 2 |
| 12.46 | 1.50 | Male | No | Fri | Dinner | 2 |
| 11.35 | 2.50 | Female | Yes | Fri | Dinner | 2 |
| 15.38 | 3.00 | Female | Yes | Fri | Dinner | 2 |
| 44.30 | 2.50 | Female | Yes | Sat | Dinner | 3 |
| 22.42 | 3.48 | Female | Yes | Sat | Dinner | 2 |
| 20.92 | 4.08 | Female | No | Sat | Dinner | 2 |
| 15.36 | 1.64 | Male | Yes | Sat | Dinner | 2 |
| 20.49 | 4.06 | Male | Yes | Sat | Dinner | 2 |
| 25.21 | 4.29 | Male | Yes | Sat | Dinner | 2 |
| 18.24 | 3.76 | Male | No | Sat | Dinner | 2 |
| 14.31 | 4.00 | Female | Yes | Sat | Dinner | 2 |
| 14.00 | 3.00 | Male | No | Sat | Dinner | 2 |
| 7.25 | 1.00 | Female | No | Sat | Dinner | 1 |
| 38.07 | 4.00 | Male | No | Sun | Dinner | 3 |
| 23.95 | 2.55 | Male | No | Sun | Dinner | 2 |
| 25.71 | 4.00 | Female | No | Sun | Dinner | 3 |
| 17.31 | 3.50 | Female | No | Sun | Dinner | 2 |
| 29.93 | 5.07 | Male | No | Sun | Dinner | 4 |
| 10.65 | 1.50 | Female | No | Thur | Lunch | 2 |
| 12.43 | 1.80 | Female | No | Thur | Lunch | 2 |
| 24.08 | 2.92 | Female | No | Thur | Lunch | 4 |
| 11.69 | 2.31 | Male | No | Thur | Lunch | 2 |
| 13.42 | 1.68 | Female | No | Thur | Lunch | 2 |
| 14.26 | 2.50 | Male | No | Thur | Lunch | 2 |
| 15.95 | 2.00 | Male | No | Thur | Lunch | 2 |
| 12.48 | 2.52 | Female | No | Thur | Lunch | 2 |
| 29.80 | 4.20 | Female | No | Thur | Lunch | 6 |
| 8.52 | 1.48 | Male | No | Thur | Lunch | 2 |
| 14.52 | 2.00 | Female | No | Thur | Lunch | 2 |
| 11.38 | 2.00 | Female | No | Thur | Lunch | 2 |
| 22.82 | 2.18 | Male | No | Thur | Lunch | 3 |
| 19.08 | 1.50 | Male | No | Thur | Lunch | 2 |
| 20.27 | 2.83 | Female | No | Thur | Lunch | 2 |
| 11.17 | 1.50 | Female | No | Thur | Lunch | 2 |
| 12.26 | 2.00 | Female | No | Thur | Lunch | 2 |
| 18.26 | 3.25 | Female | No | Thur | Lunch | 2 |
| 8.51 | 1.25 | Female | No | Thur | Lunch | 2 |
| 10.33 | 2.00 | Female | No | Thur | Lunch | 2 |
| 14.15 | 2.00 | Female | No | Thur | Lunch | 2 |
| 16.00 | 2.00 | Male | Yes | Thur | Lunch | 2 |
| 13.16 | 2.75 | Female | No | Thur | Lunch | 2 |
| 17.47 | 3.50 | Female | No | Thur | Lunch | 2 |
| 34.30 | 6.70 | Male | No | Thur | Lunch | 6 |
| 41.19 | 5.00 | Male | No | Thur | Lunch | 5 |
| 27.05 | 5.00 | Female | No | Thur | Lunch | 6 |
| 16.43 | 2.30 | Female | No | Thur | Lunch | 2 |
| 8.35 | 1.50 | Female | No | Thur | Lunch | 2 |
| 18.64 | 1.36 | Female | No | Thur | Lunch | 3 |
| 11.87 | 1.63 | Female | No | Thur | Lunch | 2 |
| 9.78 | 1.73 | Male | No | Thur | Lunch | 2 |
| 7.51 | 2.00 | Male | No | Thur | Lunch | 2 |
| 14.07 | 2.50 | Male | No | Sun | Dinner | 2 |
| 13.13 | 2.00 | Male | No | Sun | Dinner | 2 |
| 17.26 | 2.74 | Male | No | Sun | Dinner | 3 |
| 24.55 | 2.00 | Male | No | Sun | Dinner | 4 |
| 19.77 | 2.00 | Male | No | Sun | Dinner | 4 |
| 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
| 48.17 | 5.00 | Male | No | Sun | Dinner | 6 |
| 25.00 | 3.75 | Female | No | Sun | Dinner | 4 |
| 13.39 | 2.61 | Female | No | Sun | Dinner | 2 |
| 16.49 | 2.00 | Male | No | Sun | Dinner | 4 |
| 21.50 | 3.50 | Male | No | Sun | Dinner | 4 |
| 12.66 | 2.50 | Male | No | Sun | Dinner | 2 |
| 16.21 | 2.00 | Female | No | Sun | Dinner | 3 |
| 13.81 | 2.00 | Male | No | Sun | Dinner | 2 |
| 17.51 | 3.00 | Female | Yes | Sun | Dinner | 2 |
| 24.52 | 3.48 | Male | No | Sun | Dinner | 3 |
| 20.76 | 2.24 | Male | No | Sun | Dinner | 2 |
| 31.71 | 4.50 | Male | No | Sun | Dinner | 4 |
| 10.59 | 1.61 | Female | Yes | Sat | Dinner | 2 |
| 10.63 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 |
| 15.81 | 3.16 | Male | Yes | Sat | Dinner | 2 |
| 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 |
| 31.85 | 3.18 | Male | Yes | Sun | Dinner | 2 |
| 16.82 | 4.00 | Male | Yes | Sun | Dinner | 2 |
| 32.90 | 3.11 | Male | Yes | Sun | Dinner | 2 |
| 17.89 | 2.00 | Male | Yes | Sun | Dinner | 2 |
| 14.48 | 2.00 | Male | Yes | Sun | Dinner | 2 |
| 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 |
| 34.63 | 3.55 | Male | Yes | Sun | Dinner | 2 |
| 34.65 | 3.68 | Male | Yes | Sun | Dinner | 4 |
| 23.33 | 5.65 | Male | Yes | Sun | Dinner | 2 |
| 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 |
| 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 |
| 40.55 | 3.00 | Male | Yes | Sun | Dinner | 2 |
| 20.69 | 5.00 | Male | No | Sun | Dinner | 5 |
| 20.90 | 3.50 | Female | Yes | Sun | Dinner | 3 |
| 30.46 | 2.00 | Male | Yes | Sun | Dinner | 5 |
| 18.15 | 3.50 | Female | Yes | Sun | Dinner | 3 |
| 23.10 | 4.00 | Male | Yes | Sun | Dinner | 3 |
| 15.69 | 1.50 | Male | Yes | Sun | Dinner | 2 |
| 19.81 | 4.19 | Female | Yes | Thur | Lunch | 2 |
| 28.44 | 2.56 | Male | Yes | Thur | Lunch | 2 |
| 15.48 | 2.02 | Male | Yes | Thur | Lunch | 2 |
| 16.58 | 4.00 | Male | Yes | Thur | Lunch | 2 |
| 7.56 | 1.44 | Male | No | Thur | Lunch | 2 |
| 10.34 | 2.00 | Male | Yes | Thur | Lunch | 2 |
| 43.11 | 5.00 | Female | Yes | Thur | Lunch | 4 |
| 13.00 | 2.00 | Female | Yes | Thur | Lunch | 2 |
| 13.51 | 2.00 | Male | Yes | Thur | Lunch | 2 |
| 18.71 | 4.00 | Male | Yes | Thur | Lunch | 3 |
| 12.74 | 2.01 | Female | Yes | Thur | Lunch | 2 |
| 13.00 | 2.00 | Female | Yes | Thur | Lunch | 2 |
| 16.40 | 2.50 | Female | Yes | Thur | Lunch | 2 |
| 20.53 | 4.00 | Male | Yes | Thur | Lunch | 4 |
| 16.47 | 3.23 | Female | Yes | Thur | Lunch | 3 |
| 26.59 | 3.41 | Male | Yes | Sat | Dinner | 3 |
| 38.73 | 3.00 | Male | Yes | Sat | Dinner | 4 |
| 24.27 | 2.03 | Male | Yes | Sat | Dinner | 2 |
| 12.76 | 2.23 | Female | Yes | Sat | Dinner | 2 |
| 30.06 | 2.00 | Male | Yes | Sat | Dinner | 3 |
| 25.89 | 5.16 | Male | Yes | Sat | Dinner | 4 |
| 48.33 | 9.00 | Male | No | Sat | Dinner | 4 |
| 13.27 | 2.50 | Female | Yes | Sat | Dinner | 2 |
| 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 |
| 12.90 | 1.10 | Female | Yes | Sat | Dinner | 2 |
| 28.15 | 3.00 | Male | Yes | Sat | Dinner | 5 |
| 11.59 | 1.50 | Male | Yes | Sat | Dinner | 2 |
| 7.74 | 1.44 | Male | Yes | Sat | Dinner | 2 |
| 30.14 | 3.09 | Female | Yes | Sat | Dinner | 4 |
| 12.16 | 2.20 | Male | Yes | Fri | Lunch | 2 |
| 13.42 | 3.48 | Female | Yes | Fri | Lunch | 2 |
| 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
| 15.98 | 3.00 | Female | No | Fri | Lunch | 3 |
| 13.42 | 1.58 | Male | Yes | Fri | Lunch | 2 |
| 16.27 | 2.50 | Female | Yes | Fri | Lunch | 2 |
| 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
| 20.45 | 3.00 | Male | No | Sat | Dinner | 4 |
| 13.28 | 2.72 | Male | No | Sat | Dinner | 2 |
| 22.12 | 2.88 | Female | Yes | Sat | Dinner | 2 |
| 24.01 | 2.00 | Male | Yes | Sat | Dinner | 4 |
| 15.69 | 3.00 | Male | Yes | Sat | Dinner | 3 |
| 11.61 | 3.39 | Male | No | Sat | Dinner | 2 |
| 10.77 | 1.47 | Male | No | Sat | Dinner | 2 |
| 15.53 | 3.00 | Male | Yes | Sat | Dinner | 2 |
| 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
| 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 |
| 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 |
| 35.83 | 4.67 | Female | No | Sat | Dinner | 3 |
| 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
| 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
| 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
Perform simple linear regression
OLS1<-lm(formula=tip~total_bill,data=tips)
#general output
OLS1
Call:
lm(formula = tip ~ total_bill, data = tips)
Coefficients:
(Intercept) total_bill
0.9203 0.1050
#more common output
summary(OLS1)
Call:
lm(formula = tip ~ total_bill, data = tips)
Residuals:
Min 1Q Median 3Q Max
-3.1982 -0.5652 -0.0974 0.4863 3.7434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.920270 0.159735 5.761 2.53e-08 ***
total_bill 0.105025 0.007365 14.260 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.022 on 242 degrees of freedom
Multiple R-squared: 0.4566, Adjusted R-squared: 0.4544
F-statistic: 203.4 on 1 and 242 DF, p-value: < 2.2e-16
#correlation or R-squared...
cor(tips$total_bill,tips$tip)^2
[1] 0.4566166
Formally, this simple linear model related the independent variables \(x_i\) to the dependent variables \(y_i\) in our data set via two parameters: an intercept, and a slope. Mathematically, we express this relation in the following form:
\[ f(x_i) = \beta * x_i + \alpha + \epsilon_i \]
where \(\epsilon_i\) accounts for the difference between the model and the data for each data point \((x_i,y_i)\). If we have a perfect model, these errors, \(\epsilon_i\), are all zero, and \(y_i = f(x_i)\). In real life, however, the error terms rarely vanish because even if the original relationship is perfect noise creeps into the measurement process.
As a result, in this simple example we wish to determine the model parameters: \(\beta_i\), and \(\alpha_i\) that minimize the values of \(\epsilon_i\). We could perform this process in an iterative manner, trying different values for the model parameters and measuring the error function. This approach is often used in machine learning, where we define a cost function that we seek to minimize by selecting the best model parameters.
In the case of a simple linear model, we have several potential cost (or loss) functions that we could seek to minimize, but we will use the common l2-norm: \(\epsilon_i^2 = \left( \ y_i - f(x_i) \ \right)^2\), where \(f(x_i)\) is defined by our model parameters. We demonstrate this approach visually in the following code block, where we minimize the sum of the l2-norm model residuals, which is done by finding the best model parameters: \(\hat{\beta}\), and \(\hat{\alpha}\).
Formulas 😢

#Get some data
AnsDat<-anscombe%>%
select(y1,x1)
#extract x and y columns
Y<-AnsDat$y1
X<-AnsDat$x1
#find the number of data points
n<-nrow(AnsDat)
#determine mean values
mean_x<-mean(X,na.rm = TRUE)
mean_y<-mean(Y,na.rm = TRUE)
#determine best fit model parameters (from simple linear regression)
beta = sum((X - mean_x) * (Y - mean_y)) / sum((X - mean_x)**2)
beta
[1] 0.5000909
alpha = mean_y - beta * mean_x
alpha
[1] 3.000091
Call:
lm(formula = Y ~ X, data = AnsDat)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
X 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Plots
library(ggplot2)
#create regression plot
ggplot(AnsDat,aes(x1, y1)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
geom_segment(aes(x=X, xend=X, y=Y, yend=lm(Y~X)$fitted.values, color="error"))+
theme_minimal() +
labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
theme(legend.title = element_blank())

This simple example demonstrates a fundamental concept in machine learning, namely the minimization of a cost (or loss) function, which quantifies how well a model represents a data set. For a given data set, the cost function is completely specified by the model parameters, thus a more complex model has a more complex cost function, which can become difficult to minimize. To clarify this point, we now turn to the exploration of the shape of cost functions.
For simplicity, we start with a one-dimensional cost function, a linear model with no intercept: \(f(x_i) = \beta x_i\). In the following code cell, we compute the cost function for a given data set as a function of the unknown parameter \(\beta\). In this case, the minimum is easy to visualize, given the steepness of the cost function around the minimum.
library(ggplot2)
costplot<-as.data.frame(cbind(betas,l2n))
#create regression plot
ggplot(costplot,aes(betas, l2n)) +
geom_point(color="blue") + geom_line()+
geom_vline(xintercept=0, color="red")

In general, however, we face two challenges:
the cost function will likely be more complex, and
our data will be higher dimensional.
In general, we must employ a (potentially) complex mathematical technique to find the (hopefully) global minimum of the cost function. We can increase the complexity of our cost function analysis by extending the original model to include both a slope and an intercept. We now must find the minimum of this two dimensional model, given our observed data. We do this in the following code cell where we generate a grid of values in our two parameters, and compute the cost function for these different parameter combinations.
To display the data which generates a sampling grid across potential values for the slope \(\beta\) and intercept \(\alpha\) in our model. We once again vectorize our cost function and broadcast it across the sampling grid. We accumulate the cost at each grid point and generate a two-dimensional image of the values of the cost function across our sampling grid. To make the image appear cleaner, we perform Gaussian interpolation between sample points.
As the following two-dimensional image displays, our cost function is not aligned with either parameter, but is steeper in the slope parameter and less steep in the intercept parameter. Thus, we would expect that small changes in the slope will quickly increase our cost (which we saw in the previous one-dimensional example), while small changes in the intercept will produce smaller changes in our cost function (note that the range for intercepts is much larger than the range for the slope parameters).
#define our betas
betas<-seq(-4,4,length.out=100)
alphas<-seq(-40,40,length.out=100)
## Generate a grid of X- and Y- values on which to predict
grid <-expand.grid(betas,alphas)
#define our cost function
l2n2 = mapply( function(m,b) log(sqrt(sum((as.matrix(tips$tip) - m*as.matrix(tips$total_bill) - b)^2))),as.matrix(grid$Var1),as.matrix(grid$Var2)) # The L2-norm
library(ggplot2)
ggplot(grid, aes(Var1, Var2)) +
geom_raster(aes(fill=l2n2),show.legend = FALSE) +
geom_point(color="deepskyblue3",aes(OLS1$coefficients[[2]],OLS1$coefficients[[1]]))+
theme_minimal() +
labs(x=expression(beta), y=expression(alpha), title=expression(paste("Cost function for"," ",y==beta*x+alpha))) +
theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
theme(legend.title = element_blank())

As we move to higher dimensional data sets or more complex cost functions, the challenge of finding the global minimum becomes increasingly difficult. As a result, many mathematical techniques have been developed to find the global minimum of a (potentially) complex function. The standard approach is gradient descent, where we use the fact that the first derivative (or gradient) measures the slope of a function at a given point. We can use the slope to infer which direction is downhill and thus travel (hopefully) towards the minimum.
A major challenge with this approach is the potential to become stuck in a local and not global minima. Thus, modifications are often added to reduce the likelihood of becoming stuck in a local minimum. One popular example of this approach is known as stochastic gradient descent. This algorithm employs standard gradient descent, but adds an occasional random jump in the parameter space to reduce the chances of being stuck in a local valley. Another, very different, approach to this problem is the use of genetic algorithms, which employ techniques from evolutionary biology to minimize the cost function.
For a mental picture of this process, imagine hiking in the mountains and flip the challenge to finding the highest peak, so we will use gradient ascent. Gradient ascent is similar to finding the local mountain peak and climbing it. This local peak might look like it is the largest, but a random jump away from the local peak might enable one to view much larger peaks beyond, which can subsequently be climbed with a new gradient ascent.
Whenever you perform machine learning in the future, you should keep in mind that the model that you generate for a given data set has generally resulted from the minimization of a cost function. Thus, there remains the possibility that with more effort, more data, or a better cost minimization strategy, a new, and better model may potentially exist.
#set the seed :)
set.seed(1)
#get our samples
#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(tips$tip, p = .6, list = FALSE, times = 1)
#look at the first few
#head(trainIndex)
#grab the data
tipsTrain <- tips[ trainIndex,]
tipsTest <- tips[-trainIndex,]
In the following code cells, we use the lm estimator to fit our sample data, plot the results, and finally display the fit coefficients.
The first code cell defines a function that will make two plots. The top plot is a comparison between a single independent variable (Total Bill) and the dependent variable (Tip). This plot differentiates the training data, the testing data, and the linear model. The bottom plot displays the model residuals (dependent variable - model result) as a function of the independent variable. The primary benefit of this plot is the ability to identify any structure in the residuals, which can indicate a bad model. For example, if the residual plot shows a linear relationship, that indicates the original model incorrectly related the independent and dependent variables.
In the following code cells, we first compute a linear fit with no intercept, after which we compute a linear fit with both a slope and an intercept. The fit results are displayed as well as the regression and residual plots.
The code below computes a regression with no intercept.
#fit simple linear regression model
model_noint <- lm(tip ~ 0+total_bill , data = tipsTrain)
noint_results<-predict(model_noint,tipsTest)
###compute fit
summary(model_noint)
Call:
lm(formula = tip ~ 0 + total_bill, data = tipsTrain)
Residuals:
Min 1Q Median 3Q Max
-3.7910 -0.1871 0.2019 0.7518 4.1204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
total_bill 0.142010 0.004331 32.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.153 on 147 degrees of freedom
Multiple R-squared: 0.8797, Adjusted R-squared: 0.8789
F-statistic: 1075 on 1 and 147 DF, p-value: < 2.2e-16
| RMSE |
|---|
| 0.9808281 |
Below is the regression plot for the model with no intercept.
tipsTest$Sample<-"Testing"
tipsTrain$Sample<-"Training"
Combined_Tips<-rbind(tipsTest,tipsTrain)
#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
geom_point(alpha=.5) +
theme_minimal() +
labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
geom_abline(aes(slope=model_noint$coefficients[[1]],intercept=0),color="red")

Below is a residual (error) plot.
library(tidyverse)
#create residuals
testwithpred<-as.data.frame(cbind(noint_results,tipsTest))
#create residuals
testwithpred<-testwithpred%>%
rename(prediction=noint_results)%>%
mutate(error=tip-prediction)
#create regression plot with customized style
ggplot(testwithpred,aes(x=total_bill, y=error)) +
geom_point(alpha=.5,color="deepskyblue") +
theme_minimal() +
labs(x='Total Bill', y='Error', title='Regression Error Plot') +
theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
geom_hline(yintercept=0,color="red",linetype="dashed")

Link to some good examples of interpreting residual plots 🏫
Model with an intercept
#fit simple linear regression model
model_int <- lm(tip ~ total_bill , data = tipsTrain)
int_results<-predict(model_int,tipsTest)
###compute fit
summary(model_int)
Call:
lm(formula = tip ~ total_bill, data = tipsTrain)
Residuals:
Min 1Q Median 3Q Max
-3.1243 -0.5729 -0.0703 0.4668 3.4083
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.018249 0.209997 4.849 3.14e-06 ***
total_bill 0.099788 0.009596 10.399 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.074 on 146 degrees of freedom
Multiple R-squared: 0.4255, Adjusted R-squared: 0.4216
F-statistic: 108.1 on 1 and 146 DF, p-value: < 2.2e-16
| RMSE |
|---|
| 0.9408934 |
Below is a model with an intercept
#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
geom_point(alpha=.5) +
theme_minimal() +
labs(x='X Values', y='Y Values', title='Linear Regression Plot') +
theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
geom_abline(aes(slope=model_int$coefficients[[2]],intercept=model_int$coefficients[[1]]),color="red")

Residual Plot
#create residuals
testwithpred2<-as.data.frame(cbind(int_results,tipsTest))
#create residuals
testwithpred2<-testwithpred2%>%
rename(prediction=int_results)%>%
mutate(error=tip-prediction)
#create regression plot with customized style
ggplot(testwithpred2,aes(x=total_bill, y=error)) +
geom_point(alpha=.5,color="deepskyblue") +
theme_minimal() +
labs(x='Total Bill', y='Error', title='Regression Error Plot') +
theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
geom_hline(yintercept=0,color="red",linetype="dashed")

Often, using more data will result in more accurate models, since finer details can be captured. For example, if we see structure in a residual plot, the easiest solution is often to add additional independent variables to our model, which results in a multivariate linear regression model. The only major change required to our previous model building code is the expansion of our equation in the lm() function to include the additional independent variables.
To demonstrate building a multi-variate regression model, the following code uses both the total_bill and size features from the tips data set to use as independent variables. The tip feature is used as the dependent variable.
The following code generates a multi-variate linear model, displays the model parameters, and displays the regression and residual plots. To make the regression plot, we must use only one feature (in this case the total_bill). As a result, when we display the generated model, we get a series of lines that are the projections of the multi-variate model on this two-dimensional figure.
#fit simple linear regression model
model_multi <- lm(tip ~ total_bill+size , data = tipsTrain)
multi_results<-predict(model_multi,tipsTest)
###compute fit
summary(model_multi)
Call:
lm(formula = tip ~ total_bill + size, data = tipsTrain)
Residuals:
Min 1Q Median 3Q Max
-2.5272 -0.5926 -0.0678 0.5428 3.3375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46842 0.24272 1.930 0.055576 .
total_bill 0.07291 0.01135 6.425 1.77e-09 ***
size 0.41764 0.10448 3.997 0.000102 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.023 on 145 degrees of freedom
Multiple R-squared: 0.4825, Adjusted R-squared: 0.4754
F-statistic: 67.6 on 2 and 145 DF, p-value: < 2.2e-16
| RMSE |
|---|
| 1.039275 |
library(plotly)
scatter.plot<-plot_ly(Combined_Tips, x = ~total_bill, y = ~size, z = ~tip, color = ~Sample, colors = c('lightblue', 'violet'))%>%
add_markers(size=6)%>%
layout(scene = list(
xaxis = list(title = 'Total Bill'),
yaxis = list(title = 'Size'),
zaxis = list(title = 'Tip')))
#scatter.plot
library(reshape2)
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.05
#Setup Axis
axis_x <- seq(min(tipsTest$total_bill), max(tipsTest$total_bill), by = graph_reso)
axis_y <- seq(min(tipsTest$size), max(tipsTest$size), by = graph_reso)
#Sample points
lm_surface <- expand.grid(total_bill = axis_x,size = axis_y,KEEP.OUT.ATTRS = F)
lm_surface$tips <- predict(model_multi, newdata = lm_surface)
lm_surface <- acast(lm_surface, size ~ total_bill, value.var = "tips")
scatter.plot<- add_trace(p = scatter.plot,
z = lm_surface,
x = axis_x,
y = axis_y,
type = "surface",colorscale = list(c(0, 1), c("wheat", "royalblue")))%>%
layout(legend = list(x = -.1, y = 1), title="Regression Plot")
scatter.plot
#create residuals
testwithpred3<-as.data.frame(cbind(multi_results,tipsTest))
#create residuals
testwithpred3<-testwithpred3%>%
rename(prediction=multi_results)%>%
mutate(error=tip-prediction)
library(plotly)
scatter.plot<-plot_ly(testwithpred3, x = ~total_bill, y = ~size, z = ~error)%>%
add_markers(size=6)%>%
layout(scene = list(
xaxis = list(title = 'Total Bill'),
yaxis = list(title = 'Size'),
zaxis = list(title = 'Error')))
#scatter.plot
library(reshape2)
#Graph Resolution (more important for more complex shapes)
graph_reso <- 0.05
#Setup Axis
axis_x <- seq(min(tipsTest$total_bill), max(tipsTest$total_bill), by = graph_reso)
axis_y <- seq(min(tipsTest$size), max(tipsTest$size), by = graph_reso)
#Sample points
lm_surface <- expand.grid(total_bill = axis_x,size = axis_y,KEEP.OUT.ATTRS = F)
lm_surface$error <-rep(0,nrow(lm_surface))
lm_surface <- acast(lm_surface, size ~ total_bill, value.var = "error")
scatter.plot<- add_trace(p = scatter.plot,
z = lm_surface,
x = axis_x,
y = axis_y,
type = "surface",
colorscale = list(c(0,1), c("red","red")))%>%
layout(showlegend=FALSE, title= "Error Plot")
hide_colorbar(scatter.plot)
total_bill with tip. What is the RSquared? What is the RMSE?total_bill with tip and size. What is the RSquared? What is the RMSE?Many data sets contain features that are non-numerical. For example, the tips data set contains a day feature that can take one of four values: Thur, Fri, Sat, and Sun. This data set also contains a sex feature that can be Female or Male, and a smoker feature that can be No or Yes. Each of these features are categorical features, in that they can only take on one of a limited number of possible values. In general, the possible states are fixed, such as the sex, smoker, and day features discussed previously.
Categorical features can take several forms. For example, a categorical feature, such as sex or smoker that can take on one of two values is known as a binary feature. Furthermore, categorical features can also be categorized into nominal and ordinal features (note that other classes are also possible, but beyond the scope of this class).
A nominal feature either is in a category or it isn’t, and there are no relations between the different categories. For example, the sex category is nominal since there is no numerical relation or ordering among the possible values. On the other hand, an ordinal feature is a categorical feature where the possible values have an intrinsic relationship. For example, if we encode the results of a race as first, second, and third, these values have a relationship, in that first comes before the other two, and the difference between first and second is the same as between second and third. In our tips example, we could treat the day features in this manner, since the days often are treated as having an ordinal relationship.
df = data.frame(Color = c("Red", "Blue", "Green", "Blue", "Blue", "Red"))
knitr::kable(df)
| Color |
|---|
| Red |
| Blue |
| Green |
| Blue |
| Blue |
| Red |
This encoding is fine if the data are ordinal, but in this case, our colors are likely nominal and there is no numerical relationship between the different features. Thus, we need to perform an additional transformation to convert our data into a numerical format that a machine learning model can effectively process. To do this, a commonly used approach known as One Hot Encoding is used. This approach generates a new feature for each possible value in our category. Thus, for our four colors, we need four features. These features will be binary, in that a value of zero indicates that the feature is not present for the specific instance, and a value of one indicates it is present. Furthermore, only one set of these new features can be present (or on) for a specific instance.
library(varhandle)
dumvars<-as.data.frame(to.dummy(df$Color,"dum"))
knitr::kable(dumvars)
| dum.Blue | dum.Green | dum.Red |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 0 | 0 | 1 |
We fit the model, display the fit coefficients, compute the model performance, and finally display the regression model plot and the residual model plot. In this case, our new model performs slightly worse than the original single variable linear regression model. This suggests that the day of the week is not an important variable in the underlying relationship between total_bill and tip. By evaluating other feature combinations, you may be able to find a better predicting model.
dumvars_Train<-as.data.frame(to.dummy(tipsTrain$day,"dum"))
dumvars_Train<-cbind(dumvars_Train,tipsTrain)
#fit simple linear regression model
model_dum1 <- lm(tip ~total_bill+ dum.Sat+dum.Sun+dum.Thur+dum.Fri , data = dumvars_Train)
dumvars_Test<-as.data.frame(to.dummy(tipsTest$day,"dum"))
dumvars_Test<-cbind(dumvars_Test,tipsTest)
dum1_results<-predict(model_dum1,dumvars_Test)
###compute fit
summary(model_dum1)
Call:
lm(formula = tip ~ total_bill + dum.Sat + dum.Sun + dum.Thur +
dum.Fri, data = dumvars_Train)
Residuals:
Min 1Q Median 3Q Max
-3.0514 -0.5612 -0.0356 0.4624 3.2878
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.023612 0.333353 3.071 0.00256 **
total_bill 0.098554 0.010186 9.675 < 2e-16 ***
dum.Sat -0.037771 0.330383 -0.114 0.90914
dum.Sun 0.124046 0.343880 0.361 0.71884
dum.Thur -0.004739 0.333255 -0.014 0.98867
dum.Fri NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.083 on 143 degrees of freedom
Multiple R-squared: 0.4276, Adjusted R-squared: 0.4116
F-statistic: 26.71 on 4 and 143 DF, p-value: < 2.2e-16
| RMSE |
|---|
| 0.9394876 |
#fit simple linear regression model
model_dum2 <- lm(tip ~ total_bill+factor(day) , data = tipsTrain)
dum2_results<-predict(model_dum2,tipsTest)
###compute fit
summary(model_dum2)
Call:
lm(formula = tip ~ total_bill + factor(day), data = tipsTrain)
Residuals:
Min 1Q Median 3Q Max
-3.0514 -0.5612 -0.0356 0.4624 3.2878
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.023612 0.333353 3.071 0.00256 **
total_bill 0.098554 0.010186 9.675 < 2e-16 ***
factor(day)Sat -0.037771 0.330383 -0.114 0.90914
factor(day)Sun 0.124046 0.343880 0.361 0.71884
factor(day)Thur -0.004739 0.333255 -0.014 0.98867
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.083 on 143 degrees of freedom
Multiple R-squared: 0.4276, Adjusted R-squared: 0.4116
F-statistic: 26.71 on 4 and 143 DF, p-value: < 2.2e-16
| RMSE |
|---|
| 0.9394876 |
stargazer::stargazer(model_dum2,title="Fancy Reg Table",
type = "html",
float = TRUE,
report = "vcs*",
no.space = TRUE,
header=FALSE,
single.row = TRUE,
#font.size = "small",
intercept.bottom = F)
| Dependent variable: | |
| tip | |
| Constant | 1.024 (0.333)*** |
| total_bill | 0.099 (0.010)*** |
| factor(day)Sat | -0.038 (0.330) |
| factor(day)Sun | 0.124 (0.344) |
| factor(day)Thur | -0.005 (0.333) |
| Observations | 148 |
| R2 | 0.428 |
| Adjusted R2 | 0.412 |
| Residual Std. Error | 1.083 (df = 143) |
| F Statistic | 26.705*** (df = 4; 143) |
| Note: | p<0.1; p<0.05; p<0.01 |
Regression Plot
#create regression plot with customized style
ggplot(Combined_Tips,aes(x=total_bill, y=tip,color=Sample)) +
geom_point(alpha=.5) +
theme_minimal() +
labs(x='total bill', y='tip', title='Linear Regression Plot') +
theme(plot.title = element_text(hjust=0.5, size=20, face='bold')) +
geom_abline(aes(slope=model_dum2$coefficients[[2]],
intercept=model_int$coefficients[[1]]),color="red")+
geom_abline(aes(slope=model_dum2$coefficients[[2]],
intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[3]]),color="red")+
geom_abline(aes(slope=model_dum2$coefficients[[2]],
intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[4]]),color="red")+
geom_abline(aes(slope=model_dum2$coefficients[[2]],
intercept=model_int$coefficients[[1]]+model_dum2$coefficients[[5]]),color="red")

Residual Plot
#create residuals
testwithpred4<-as.data.frame(cbind(dum2_results,tipsTest))
#create residuals
testwithpred4<-testwithpred4%>%
rename(prediction=dum2_results)%>%
mutate(error=tip-prediction)
#create regression plot with customized style
ggplot(testwithpred4,aes(x=total_bill, y=error)) +
geom_point(alpha=.5,color="deepskyblue") +
theme_minimal() +
labs(x='Total Bill', y='Error', title='Regression Error Plot') +
theme(plot.title = element_text(hjust=0.25, size=20, face='bold')) +
geom_hline(yintercept=0,color="red",linetype="dashed")

tip feature from the total_bill and categorical day features. Repeat this process, but use the total_bill, size, sex, and time features. Has the prediction performance improved, i.e., what is the RSquared and RMSE?The is borrowed content from https://insights.principa.co.za/4-types-of-data-analytics-descriptive-diagnostic-predictive-prescriptive↩︎
This is a narrow definition because you can also impute with summary statistics such as mean or median.↩︎
Content for this caret portion is borrowed from https://www.rebeccabarter.com/blog/2017-11-17-caret_tutorial/↩︎
https://www.r-bloggers.com/2013/11/computing-and-visualizing-pca-in-r/↩︎